Why DUSt3R:ΒΆ
There are man methods which address this kind of problem in 3D Vision as shown in the image:
But we are interested in mapping images to 3D point cloud basically dense 3D reconstruction. For this there are many RGB-to-3D solutions based on the idea of building a differentiable SfM pipeline but they require camera parameters and then they output depthmap and a relative camera pose.
On the other hand DUSt3R outputs pointmaps which are basically dense 2D fields of 3D points, which handle camera poses implicitly without requiring any camera intrinsic parameters.
Architecture of DUSt3R:ΒΆ
Two views of a scene are first encoded in a Siamese manner with a shared ViT encoder. The resulting token representations are then passed to two transformer decoders that constantly exchange information via cross-attention. Finally, two regression heads output the two corresponding pointmaps and associated confidence maps. Importantly, the two pointmaps are expressed in the same coordinate frame of the first image. The network is trained using a simple regression loss.
Training DUSt3RΒΆ
Authors train DUSt3R in 3 stages:
- step 1 - train dust3r for 224 resolution, then
- step 2 - train dust3r for 512 resolution, and then
- step 3 - train dust3r for 512 resolution with dpt
Evaluation MetricsΒΆ
- Authors compare pose estmated by DUSt3R agianst SOTA method like RelPose
- They use Relative Rotation Accuracy (RRA) and Relative Translation Accuracy (RTA) for each image pair to evaluate the relative pose error and select a threshold Ο = 15 to report RTA@15 and RRA@15.
- They also report mean Average Accuracy (mAA)@30, defined as the area under the curve accuracy of the angular differences at min(RRA@30, RTA@30).
My attempts at training DUSt3R from scratchΒΆ
I tried training DUSt3R from scratch using (co3d) dataset but becuase of no computational resources at hand I am not able to train the models to full extent.
I managed to train DUSt3R for 224 resolution, for 5 epochs with batch size of 2, on google colab with T4 gpu.
At some point I stopped trying as it is not possible to train DUSt3R with resources from google colab.
#Clone Repo
%cd /PKN/study_projects/cvg_bern/task_2
!git clone --recursive https://github.com/naver/dust3r
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
#Setup Environment
%pip install -r requirements.txt
Lets create a dataset (co3d)ΒΆ
# download and prepare the co3d subset
!mkdir -p /PKN/study_projects/cvg_bern/task_2/data/co3d_subset
%cd /PKN/study_projects/cvg_bern/task_2/data/co3d_subset
!git clone https://github.com/facebookresearch/co3d
%cd co3d
!python3 ./co3d/download_dataset.py --download_folder ../ --single_sequence_subset
# Creating image pairs for training
# they utilize off-the-shelf image retrieval and point matching algorithms to match and verify image pairs
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!python3 datasets_preprocess/preprocess_co3d.py --co3d_dir /PKN/study_projects/cvg_bern/task_2/data/co3d_subset --output_dir /PKN/study_projects/cvg_bern/task_2/data/co3d_subset_processed --single_sequence_subset
# download the pretrained croco v2 checkpoint
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!mkdir -p checkpoints/
!wget https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTLarge_BaseDecoder.pth -P checkpoints/
# step 1 - train dust3r for 224 resolution
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!torchrun --nproc_per_node=1 train.py \
--train_dataset "1000 @ Co3d(split='train', ROOT='/PKN/study_projects/cvg_bern/task_2/data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=224, transform=ColorJitter)" \
--test_dataset "100 @ Co3d(split='test', ROOT='/PKN/study_projects/cvg_bern/task_2/data/co3d_subset_processed', resolution=224, seed=777)" \
--model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', img_size=(16, 16), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
--train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
--test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
--pretrained "checkpoints/CroCo_V2_ViTLarge_BaseDecoder.pth" \
--lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 2 --accum_iter 1 \
--save_freq 1 --keep_freq 5 --eval_freq 1 \
--output_dir "checkpoints/dust3r_demo_224"
# Downloading already trained weights of 512dpt:
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!mkdir -p /ori_checkpoints
!wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth -P /PKN/study_projects/cvg_bern/task_2/dust3r/ori_checkpoints/
# Downloading already trained weights of 224:
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!mkdir -p /ori_checkpoints
!wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_224_linear.pth -P /PKN/study_projects/cvg_bern/task_2/dust3r/ori_checkpoints/
Running the experiment with StructColorToaster SceneΒΆ
Using Pretrained weights of DUSt3R 512 dptΒΆ
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!python demo.py --weights /PKN/study_projects/cvg_bern/task_2/dust3r/ori_checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
Using trained weights of DUSt3R 224 from scratch by me:ΒΆ
%cd /PKN/study_projects/cvg_bern/task_2/dust3r
!python demo.py --weights /PKN/study_projects/cvg_bern/task_2/dust3r/checkpoints/checkpoint-best.pth
From visulizations it is evident that this training for just few epochs and on just few scenes is not enough for model to generalize well.
Comparison with 3D Gaussian SplattingΒΆ
ObservationsΒΆ
Vanilla 3D Gaussian splatting able to learn the geometry and complex appearance of the scene.
On the other we can see DUSt3R also perfoms well and reconstuct the geometry without needing to input camera parameters.
We can conclude that combination of both of these methods will be game changer in 3D AI.